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Some definitions and notation 

We consider strings in the alphabet {0, 1}, i.e., finite sequences of zeroes and ones in a 1-1 correspondence 
with natural numbers: 
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A <-> 

o 1 

1 -H- 2 

00 <-> 3 

01 ^ 4 

10 ^ 5 

11 ^ 6 

000 «-> 7 

001 <-> 8 



(A is the empty string) . We do not distinguish strings and number and use the terms interchangeably. They 
are usually denoted by lower case Latin letters. The set of all strings-numbers is denoted by S. The result 
of adding (concatenating) the string y to the string x is denoted by xy. We need also to encode the ordered 
pair (x, y) of strings by one string. To avoid introducing a special separator (such as a comma) let us agree 
that for x = x\X2 ■ ■ ■ x n (xi G {0, 1}) 

x = x\X\x<iXi . . . x n x n 01. (0-1) 

Then one can recover both x and y from the string xy. Denote by tci(z) and ^(z) functions such that 
m(xy) = x, ^(xy) = y. If the string z is not representable as xy then 7Ti(z) = A, ni(z) = AQ 
The length l(x) of a string x is the number of its digits; 1(A) = 0. Obviously 

l(xy) = l(x) + l(y), (0.2) 

l(x) = 2l(x) +2. (0.3) 
Let d(A) be the number of elements in a set A. Obviously 

d{x : l(x) = n} = 2 n , (0.4) 

d{x : l(x) < n} = 2 n - 1. (0.5) 

We also consider the space fl of infinite binary sequences, denoting them with lower-case Greek letters. 
fl* = fl (J S is the set of all finite and infinite sequences. Let uj 6 ST. The n-prefix of u, denoted (co) n , is the 
string of its first n digits. If u> G S with 1(oj) < n then (uj) n = uj by definition. An w € 11 is a characteristic 
sequence for the set = {ni, ri2, . . . } of positive integers if has 1 at the places n\,n%,... and zeroes 
everywhere else. Denote the set of all sequences (from il or SI*, as follows from the context) that have 
prefix x: 

T x = {uj : (w) l{x) = x}. (0.6) 

Notation x C y means T x D T y , that is the string a; is a prefix of y. The relation C is a partial order on 
S (Figure 1). 

Functions defined on the Cartesian product S n = S X . . . X S (n times) are denoted by capital Latin 
letters (except some standard functions). Sometimes a superscript n denoting the number of variables is 
added: F n — F n (xi, . . . , x n ). The sentence for all admissible values of variables yi, . . . , y n there exists a 
constant C such that for all admissible values Xx, • • ■ , x n 

F n+m ( Xl ,■„:,/.....„,„:■ G n+m ( Xl ,...,x n;y< ...,y m )+C (0.7) 

is abbreviated as follows: with parameters (y%, . . . , y n ), 

F n+m (xi, . . . , x n ; y, . . . , y m ) *4 G n+m (x u . . . , x n ; y, . . . , y m ). (0.8) 

1 More common enumerations of pairs may violate the property (10. Ill ) important below. 
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The relation !>= is denned similarly. F x G means both F ■< G and G =4 F hold. Obviously, the relations 
=<;, )p, x are transitive and 

l{x) x log 2 (x) for x > 0, (0.9) 
Z(x) x 2Z(x), (0.10) 
l(xy) x Z(y) (with x as a parameter). (0-11) 

1 Introduction 

1.1 The general construction of complexity 

The topics studied here were introduced in 1964 when A.N.Kolmogorov defined complexity of constructive 
objects. (Similar concepts were independently considered by A.A.Markov and R.J.Solomonoff.) 

A.N. Kolmogorov defines the complexity of a string x for an algorithm A as the least length of binary 
strings p encoding x, i.e., such that A(p) = x. The value so defined depends strongly on the choice of A. 
The central result that prompted all further investigations was a theorem established by A.N. Kolmogorov 
and independently (in slightly different terms) by R.J. Solomonoff. It states the existence of an optimal 
algorithm A providing the smallest (compared to any other algorithm B) value of complexity up to an 
additive constant Cb (independent of x). Complexity for an arbitrary optimal A is thus sufficiently invariant 
to be a fundamental characteristics of x. It found many applications, and quickly generated a rich theory 
(cf. for example, a survey [6]). 

In the development of this theory, several other quantities similar to complexity (though different from 
it) turned out to be useful. For example, A.A.Markov and D.Loveland considered the decision complexity 
of binary strings, P.Martin-Lof defined their deficiency of "randomness," the present author introduced 
"universal probability," etc. At present, about ten such functions are known. The need exists for some 
organization of this diversity of quantities from a unified standpoint. 

Definition 1. A finitary function is a table defining a function from a finite set A C S to S. (We assume 
it has value oo on A\S.) 

Definition 2. A volume restriction is an enumerable family V of finitary functions such that 

1- If f > 9 and g e V then f G V; 

2. 3CVf,g€V (C + mm{f,g})€V. 
We assume for simplicity that C = 1. 

Definition 3. Let V be a volume restriction. By F-majorant, we call any function F(x) such that 

1. the set of points over its graph is enumerable and 

2. for every finitary function g, if g > F then g E V. 
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Theorem 1. For any volume restriction V , there exists a V-majorant Ky(x), that is smallest (up to an 
additive constant), i.e., such that Ky{x) =4 L(x) for every V-majorant L(x). 

Proof. For a finite set M. of pairs of numbers, we get a graph of a finitary function by taking the lowest 
point of M. on each vertical line intersecting A4. Let us call this function a lower boundary of A4. 

Let partial recursive function U (i, t) enumerate the i-th enumerable set U of pairs [x, a) for every i. Let's 
define U'(i,t) enumerating U[ C U for each i, but slower than U. Namely, U' generates the next element 
only after verifying that the lower bound of the set enumerated so far belongs to V . 

Obviously U[ is a y- majorant for each i, and no majorant is "forgotten." Now let M be the set of 
pairs situated above pairs (x, a + Ci) where C is the constant in the definition of volume restriction, and 
(x,a) G U[. 

Let us prove that M. defines an (obviously optimal) 1^-majorant. In other words, every finitary function 
/ whose graph is contained in A4, belongs to the family V. By definition of A4, f > min(gi + Ci) for some 
family of functions gi G V , i < n. This implies that / G V. Indeed let h-k = mini>fe(<7i + C(i — k)) Then 
/ > ho, hk-i = C + xmn{hk, 9k} and induction on k from n down completes proving the theorem. 

For any decidable volume restriction V, one can compute a common lower bound mv(x) = min/ e y fix) 
for all y-majorants. It is simpler to study differences Ky(x) — mv(x) instead of the majorants Kv(x). 
These differences will be ^'-majorants where / G V O (/ + my) G V. Obviously vny (x) = 0. We call such 
V' "reduced." There is no need to study non-reduced decidable V. 

Theorem 2. Among reduced volume restriction, there is the one that is most "narrow." The universal 
majorant p{x) corresponding to it will be the largest This restriction is given by the condition f G V <^> 

y < i. 

Proof. Clearly, V is a reduced volume restriction. If V is any volume restriction and / G V, then finitely 
many applications of item 2 of the definition of volume restrictions yield / G V. This "extreme" majorant 
p(x) turns out to be not far from the complexity K(x) of [9] (which hence is close to the limit). 

Theorem 3. K{x) =4 p(x) 4 K(x) + 2\og 2 K(x). 

Proof. K(x) =4 p{x) by Theorem 2 (see also Theorem 4a). To prove the second inequality, we show that 
any finitary function f(x) > K(x) + 2 log 2 K(x) belongs to volume restriction V (from Theorem 2). Indeed, 

2-/^) = V" V" 2~ I(x) < ^2 ^2 2~ K[x) - 2 ^ K{x) = 

x a x:K{x)=a a x:K(x)=a 



= Y / d{x:K(x)=a}-^-2. 

a 

Since d{x : K(x) — a} < 2 a , this < ^ a 2 ^ 2 =^ 1, which completes the proof. 

1.2 Examples of majorants 

Definition 4. (A. N. KoLMOGOROV ) 

The complexity of x with respect to a p.r. function F 1 is 



K Pl (x) d ^ { min ^( P ) 

F 1 ' \ CO, ifth 



go, if there is no such p. 



We call a word p with F x (p) — x a code or program for F 1 to restore x. 

Definition 5. (A. N. KOLMOGOROvj The conditional complexity of x for known y with respect to a p.r. 
function F 2 (jp,y) is 

*F*(x\V) - < lf ypF 2 (p,y)^x. 



2 This majorant is a logarithm of the largest (up to a constant) semicomputable probability distribution on natural numbers 
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Definition 6. (D. Loveland, A.A.Markov) 

The decision complexity of a word x with respect to a p.r. function F 2 is 

K , . dcf J minZ(p) : Vi < i(ar) F 2 (p, i) = x t 
F ^ ' y oo, if there is no such a p, 

here Xi is the i-th letter of word x. 

We defined three quantities important in complexity theory. Let us show that all three are special cases 
of the general concept of V-majorant. Then, in particular, Theorem 1 will imply the famous optimality 
theorems discovery of which by A. N.Kolmogorov and R.J.Solomonoff started complexity theory. 

Let V\ be the set of unitary functions with < 2 a of x having f(x) < a. 

Let V<i be the set of unitary functions f(x, y) on pairs (x, y) (more precisely, on codes of such pairs) with 
every a, y having < 2 a of x with f(x, y) < a. 

Let V3 be the set of unitary functions with < 2 a branches in the tree of words x with f(x) < a. It is easy 
to verify that V\, V2, V3 are volume restrictions. 

We call classes A and B equivalent if for every function / from one of them there is a function g =<! / 
from the other. 

Theorem A. a) The class ofV\-majorants is equivalent to the complexity class with respect to any algorithms. 

b) The class of V2~majorants is equivalent to the conditional complexity class with respect to any algo- 
rithms. 

c) The class of V^-majorants is equivalent to the class of decision complexities. 

Proof. We prove Theorem 4a. Theorems 4b and 4c can be proven similarly. It is easy to see that for any 
A, Ka(x) is a Vi-majorant. Conversely, for any V\ majorant F one can enumerate all points (x,n) above its 
graph and map them to different p G {0, 1}™, with pairs (p, x) forming the graph of an algorithm A. F S V\ 
assures enough codes p for that. Ka(x) may exceed F(x) by < 1, QED. 

1.3 Invariant functions and complexity 

Complexity has an important property of invariance, namely 

Remark 1. Under any p.r. isomorphism between two recursively enumerable sets, the complexities of their 
elements differ at most by a constant. 

Besides, complexity has "informational correctness," namely 

Remark 2. There exists a computable enumeration of pairs f(x,y)=i, Tr(i)=x, iT2{i)—y such that the com- 
plexity of f(x,y) is at least the complexity of x and y up to an additive constant (true even for every 
computable enumeration). 

Complexity is bounded by a logarithm of its argument, i.e. contains only a very limited amount of 
information about the words. But it turns out that even among functions of arbitrary nature, no "richer" 
invariants exist. 

Theorem 5. Every invariant informationally correct (in the above sense) function F(x) is at most K(x) 
up to a multiplicative constant. The constant cannot be made additive because even changing the alphabet 
changes a constant factor. 

Proof. The algorithm A in Ka(x) can be represented as the composition of function tt\{x) and an 
invertible function. Then the theorem's assumptions imply F(x) =<; F(p) when A(p) = x. It remains to show 
F{p) < C ■ l(p). This follows from constructing four isomorphisms of natural numbers, combining which we 
can, in n steps, obtain every n-bit word from 0. QED. 
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1.4 Computable complexity majorants 

Clearly, knowing a word x and its complexity, we can find efficiently (at least, by exhaustive search) a shortest 
program coding x. Moreover, knowing x and any bound S > K(x), we can find an S'-bit program, possibly 
not a shortest one. Since complexity isn't a computable function, in practice one has to be content with its 
computable majorants giving the length of an effectively computable code, not necessary the shortest one. 
Barzdin', Petri, Kanovich showed all such majorants to be very coarse in some cases. However, we have 

Theorem 6. Every "informationally correct" (in the sense of Sec. 1.3) function which is less (up to an 
additive constant) than every computable complexity majorant is also less (up to an additive constant) than 
the complexity itself. 

Proof. Every algorithm can be represented as a composition of an invertible algorithm assuming all 
values in a recursive set and function tti(x). Complexity with respect to an invertible function is just the 
logarithm of its inverse and hence it is computable. The theorem follows. 

1.5 Decision complexity 

By many reasons, K(x) or K(x\l(x)) are not quite natural to use for studying the complexity of sequences 
(rather than terminated words). Thus A. A. Markov and D. Loveland introduced KR(x), which proved to be 
very fruitful. E.g., 

Remark 3. A sequence uj is computable if and only if KR((ui) n ) is bounded. 

Evident for KR{x), this isn't true for K(x), and for K(x\l(x)) isn't evident and remained an open problem 
for some time. An affirmative answer was given by the author independently of Kolodiy, Loveland (USA) 
and Mishin. This is implied by the following theorem relating KR(x) with K(x\l(x)). 

Theorem 7. For every uj, KR((uj) n ) is bounded if and only if K((uj) n \n) isjf| 

Proof. One direction is obvious: a computable ui has a general recursive function F (n) = (uj) n . 
Let F 2 (p,n) = F^rt), then K F2 ((uj) n \n) = 1(A) = because F 2 (A,n) = (u>) n , hence K((u)) n \n) =4 0, 
K((uj) n \n) < C. 

Let us prove the other direction. Suppose K((w) n \n) < C. We want to establish the existence of a 
procedure which, for given n, produces ui n , the n-th digit of uj. Consider all words p with length at most 
C and construct a table as shown on Figure in Sec. 2.2 of [6]: at the p-th row of the n-th column we place 
Fq(p,u) (see (1.6) from [6|) provided it halts. The set of all words Fy(p,n) in the n-th column we denote 
A n . Each A n has at most 2 C+1 words, and (uj) n G A n . 

Let I = linin-^oo d(A n ). Clearly, the set U = {n : d(A n ) > 1} is recursively enumerable and infinite. 
Moreover, there are only finitely many n with d(A n ) > I; the largest of such n we denote mi. Let k < 2 C+1 
be the number of sequences ui with K((u>) n \n) < C. Let to-2 be the smallest length of prefixes distinct for 
all these sequences (by the way, all columns starting from W2-th should contain at least k prefixes of these 
sequences, hence fe<0- Let m = max(m l! m 2 )0 

Let U' be an infinite decidable subset of U and V = U' PI {n : n > to}. 

The algorithm deciding the i-th (in lexicographical order) of our sequences proceeds as follows. To find 
its j-th digit, we select the smallest n r > j in V and start filling in the n r -th column (with words F2(p,n r ), 
Kp) ^ C)- When / words are found, we stop: there are no more. Denote B nr the set of all n r -bit words from 
A llr . Then we similarly construct the set B nr+1 and take from it all words with prefixes from B nr \ this set is 
denoted C„ r+1 . Then, words from B Ut+2 with prefixes from C nr+1 form the set C nr+2 ; C nr+3 is be the set of 
words from B nr+3 with prefixes from C„ r+2 , and so on. We stop when the current set C„ s contains exactly k 
words: they all are n s -prefixes of sequences with K((uj) n \n) < C. Selecting the i-th lowest of them we take 
its j-th digit; it is what is required. 

3 However, it was shown by Petri that there is no effective way to calculate a bound on K R((u>)„) from a bound on 
K((ui) n \n), that is, the former can be very large. 

4 Our construction uses numbers I, k,m but isn't effective, giving no procedure to find them. We only prove that the required 
algorithm exists (an intuitionist would say: "cannot but exist"), so only need the mere existence of l,k,m. 
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2 Measures and Processes 



This chapter considers deterministic and non-deterministic processes generating sequences. The central 
result is introducing a universal semi-computable measure and establishing its relation with complexity. At 
the end of the chapter, these results are applied to the study of capacities of probabilistic machines. 

2.1 Definitions. Equivalence of measures. 

Definition 7. Algorithmic process or simply process is a partial recursive function F mapping words into 
words, and such that if F(x) is defined for a word x and y C x then F(y) is also defined and F{y) C F(x). 

Let us and apply a process F to all prefixes of uj £ Q while F is defined. It outputs prefixes of a sequence 
p e Q*. This p is the result F(uj) of applying F to ui (i.e., F maps Q into ft*). 

Remark 4. There exists a universal process, i.e., a partial recursive function H, such that H(i,x) is a 
process for all i, and for any process F an i exists such that H(i,x) = F(x). Such H is easily constructed 
from a universal p.r. function. Without loss of generality we assume (and use later) H(A,A) = A. 

Processes F and H are said to be equivalent if F(lj) = G(u>) for any ui € fl. 

Remark 5. Any process has an equivalent one that is primitive recursive. 

Definition 8. We say a process F is applicable to ui if F(ui) is infinite. 

Remark 6. Any process is a continuous function on the set of sequences to which it is applicable (with the 



Definition 9. A process is fast growing (fast applicable to ui) if a monotone unbounded total recursive 
function <f>(n) exists such that for all x (respectively, for all prefixes x of ui) for which F is defined, t{F[x)) > 
<f>(£(x)). In this case we say the speed of growth (of the application to u) of process F is > <E>(n). 

Remark 7. One can easily show that a process applicable to all ui is total recursive and fast growing. 
Clearly, the reverse is also true. 

Definition 10. Let P be a probability measure over fl. We say that process P is regular if the set of 
sequences to which it is applicable has P-measure 1. 

In order to define an arbitrary measure on a Borel cr-algebra of subsets of SI, it suffices to define it on 
sets F x . 

Definition 11. A measure P on Q is computable if there exist total recursive functions F(x, n) and G(x, n) 
such that the rational number ap(x,n) — gfe^j is a 2~™-approximation of P(T X ). 

Remark 8. Obviously then, ap(x,n + I) + 2~ n+1 is a 2~™-approximation of P(T X ) from above. Hence, 
without loss of generality, we always assume ap(x,n) to be an upper bound, and take ap(x,n) — 2~ n as a 
lower bound. 

Denote by L the measure Z^r^) = 2~ l ( x \ and call it the uniform measure. It corresponds to Bernoulli 
trials with probability 1/2; it is also a Lebesgue measure on the interval [0, 1]. Obviously, L is computable. 

Theorem 8. a) For any computable measure P and any P-regular process F the measure Q(T y ) — 
P(U T x : (F(x) D y)) (i.e., the measure with which the outputs of F are distributed) is computable. 

b) For any computable measure Q there exists an L -regular process F, generating Q-distributed outputs 
from L-distributed inputs. Moreover, F has an inverse G (i.e. F(G(uj)) = ui when G is applicable) applicable 
to all non-recursive sequences except maybe some in intervals of Q-measure 0. 

5 If F((lu)„) is defined, and for all m > n, F((ui) m ) coincides with F((uj) n ) or is undefined, then F(ui) = F((ui) n ). F(ui) = A 
if F((w) n ) is undefined or empty for all n. 

6 In this topology, Q is equivalent to Cantor perfect set. 

7 A somehow weaker result was independently proven by Mann (USA). 
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Proof, a) We compute a 2~"-approxirnation (from above or below; making it an upper bound is easy) 
a.Q{y,n) to Q(T X ). Choose m such that P({u) : l(F((u)) m )) > l(y)}) > 1 — 2~(™ +1 ) (Such an m exists as 
process F is P-regular, moreover one can effectively find such an to). Take all words xG{0, l} m such that 
y C F(x), and compute ctQ(y, n) as the sum of 2~( m+ ™ +1 )-approximations to measures P(T X ) of all these x. 
Then the error is a Q (y, n) - Q(T y ) < 2- ( - n+1 '> + 2 m ■ 2~( m +™+ 1 ) = 2~ n (as there are < 2 m of a;). 

b) We consider sequences uj(zQ as reals in [0, 1] (with binary expansions uj; the cases of binary rationals, 
where such expansions have ambiguity, will be specially noted) . Figure 3 in [H| shows a distribution function 
g that corresponds to measure Q. As is well known, the random variable ,g~ 1 (^) is Q-distributed with £ 
uniformly distributed over [0, 1]. Our construction is based on this idea. 

I. A process F((a) n ) generates Q-distributed g~ 1 (a) from L-distributed inputs. It takes upper 2 _2n - 
approximations otQ(y, 2n) of Q(T y ) for each ?/e{0,l} n and outputs the longest common prefix of those 
ze{0, 1}™ for which 

«o(l/. 2n) > (<*)„ > 1-2"" - ]T a Q (y, 2n). (2.1) 

y<z y>z 

II. Due to (|2.ip . the intervals ur z contain (for each n) g-image of a. Hence, F(a), if applicable, generates 
g (a) (treating 7 in Figure 3 of [3] as a pre-image of a G [<r', <r"]). To prove F is L- regular suffices to show 
it being what we need. 

1) Let [</, a"] correspond to a single 7 with Q(7)>0. If a' < a < a" then once a' < (a) n — 2 _ " < 
(a)„+2 1_ ™ < a", only a single z satisfies (|2.ip and so is output. Thus, F is applicable to such a, though 
not always to the ends a', a". 

2) Now let a not be of such types. Then Q(UT Z ) — > as n — > 00. Hence, if a is not of type p corresponding 
to a measure interval, then ur 2 shrink to a point j3 = g~ 1 (a); their longest common prefix grows infinitely. 

3) A notable case of type p is a = g{0) with a binary rational /3: its two binary expansions may form a 
measure interval mentioned above. 

In sum, F is applicable to all sequences except some of types p, a', a" of Figure 3 in [6]. This set is 
clearly countable, so F is L-regular. 

III. The inverse process G just computes g. It may be non-applicable only to (computable by Corollary 
to Theorem 11) 7 with Q(<y)>0, and /?, with binary rational a — g(f3). If F(a) is applicable, it computes j3. 
If not, and f3 is not of mentioned type 7, it lies on an interval [t', t") of zero Q-measure. Q.E.D. 

2.2 Semi-computable measures 

Definition 12. A semi-computable (the term is justified by Theorem 9) measure is the distribution of the 
outputs of an arbitrary (not necessarily regular) process on inputs distributed according to a computable 
measure. 

Remark 9. Semi- computable measures are concentrated on 51* since a non-regular process can have finite 
outputs with positive probability. In this section, we assume T x is a set of all finite and infinite sequences 
with prefix x. 

Remark 10. The distribution of outputs of any process on inputs with an arbitrary semi- computable distri- 
bution is also semi- computable (as a composition of two processes is again a process). Any semi-computable 
measure can be obtained from a uniform measure by some process (see Theorem 8b). 

Theorem 9. A measure P is semi- computable iff total recursive functions F,G exist such that /3p(x,t) = 
j * s a mono t° ne non- decreasing in t function, and 

lim p P (x,t) =P(T X ). (2.2) 

t— >oo 

This Theorem implies that the class of semi-computable measures (more accurately, of their logarithms) 
is equivalent to the class of y-majorant, where V is a set of finitary functions / for which Y^ x eM 2~-^ x ) < 1 
for all sets M whose words are not prefixes of each other. 

Proof. Let P be a semi-computable measure. Then there exists a process F generating this measure 
from L. Let it make t steps on all words y with £(y) < t and, denoting the result by F t (y) (if no results are 
achieved yet then F t (y) — A), set (3 P (x,t) = L(UT y : x C F t (y)). 
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Inversely, suppose a measure P has a function /3p(x,t) satisfying the terms of the Theorem. We wish to 
construct a process F generating P from L. 

The idea is simple: we need to partition the interval [0, 1] into disjoint subsets of measure P(T X ), and 
to output x when our uniformly distributed input falls into a corresponding set. Now we describe the 
construction precisely. Clearly, P(T X ) > P(T x o) + P(T x i). Moreover, without loss of generality, we assume 
/3p(x,t) > (3p(x0,t) + (3p(xl,t) for all t: each time this fails, we delay growth of f3p(x0,t) and (3p(xl,t) 
proportionally to restore the inequality. It is easy to construct subsets of interval [0, 1] with the following 
conditions: to each pair (x, t) there corresponds a union I x .t of a finite number of intervals with binary 
rational ends and combined length (3p{x, t). Within this procedure for any words x ^ y of equal length, I x .t 1 
and Iy^ 2 are disjoint for all t\ and ti\ for any words x C y and any t, I y t C l x ,t\ for any t\ < ti and any x, 
Ix,tt C I x ,t 2 ■ 

Our F(z) constructs I x .t for all x,t such that l{x) < l(z) and t < l(z), and outputs a longest x such that 
z € I x ,t for some t. Obviously, such x is unique as the sets corresponding to divergent x are disjoint, and 
x' C x" for z' C z". 

2.3 Universal semi-computable measure 

Theorem 10. There exists a semi- computable measure R that is universal, i.e., such that for any semi- 
computable measure Q, there is a constant C such that C ■ R(T X ) > Q(T X ) for all x0 

Proof. By a remark in Section 2.1, a universal process H(i,x) exists. Obviously, F(z) = £ H(iri(z), ^(z)) 
is a process. Applied to uniformly distributed sequences, it generates the desired measure. Indeed, let a 
process G(x) (= H (i, x) for some i and all x) transform some set of sequences into T x . Then F(x) transforms 
into T x these same sequences with added prefix i - (maybe some others as well). Thus, the measure of T x 
cannot decrease by a factor > C2 e ^ (w i 2 ). 

Remark 11. This result does not extend to computable measures: no measure is universal among all com- 
putable measures. This is one of the reasons for introducing the notion of a semi-computable measure. 

The measure R, being (within a constant factor) "larger" than any other measure, is concentrated on the 
widest subset of f2* . 

The following issue is considered in mathematical statistics: find out what distribution P can randomly 
generate a given sequence lo. If we know nothing a priori about w, then the only (= the weakest) statement 
we can make about it is that it can be generated under distribution R. In this sense, R reflects our intuition 
about "prior probability." The following is of interest: 

a) For a constant C, the probability (under measure R) of having a 1 after n zeros is > i • c ^ ^ - . 

b) For every constant C, at most ^ fraction of n on any interval [0, N], has the probability (under R) of 
a 1 falling after n zeros to exceed ^ • Clog 2 n. 

Thus, R(0 n l) is typically around ±0 

The proof easily follows from Theorem [TT1 taking into account that the complexity KR(0 n l) does not 
exceed log 2 n + c, and for the majority of these words this complexity is almost equal to log 2 n. 

One can see an analogy between constructing the complexity KR and the universal semi-computable 
measure. It turns out these two quantities also have a quantitative connection: 

Theorem 11. \KR{x) - (- log 2 R(T X ))\ 4 2 log 2 KR(x). 

Proof. Let KR(x) — i, thus, for some pe{0,l} 4 and all n < £(x), we have Gf ] (p,n) = x n (here Gq is 
from Theorem 2.1 of [6]). Then, one can easily construct a process transforming each sequence with prefix 
£{p)p into a sequence with prefix x: this process first separates the prefix £(p), recovers £{p), "reads" p, and 

8 In other words, Q is absolutely continuous with respect to R with Radon-Nikodym derivative bounded by C from above. 

9 Note that this statement is true only for the universal (prior) probability. For example, if we know that the Sun has risen 
for 10,000 years, this does not mean that the probability of the Sun not rising tomorrow is approximately equal to 1/3,650,000. 
This statement would be true if the above fact was the only information that we have about the Sun. 
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sequentially generates G%(j>,n) for n = 1,2,.... From a uniformly distributed input, this generates sequences 
in r„ with probability > 2~ e ( e{p)p ). Thus, by Theorem 10, R(T X ) > c ■ 2~ i ( F ^P\ hence 

- log 2 R(T X ) 4 £ (Wjp) = t(p) + 2i(£(p)) = i + 2t{i) = KR{x) + 2£{KR{x)). 

Now, assume R(T X ) = q. Let us denote £(q) = [ — log 2 gj . To estimate the complexity KR(x), we 
reconstruct every symbol of x from the triple l(q),k,i (i.e., from £(q) ki), where fc£{0, 1} and i < 2^ q ' +1 . 
Our algorithm works as follows: based on i(q), it builds the tree (see Fig. 4 in [B]) of all words y with 
R(T y ) > 2 — !. For this, we compute /3^(y,t) for more and more values t and y, and add y to the tree 
when we get f3 R (y,t) > 2- e ^-\ for some t. 

The word x belongs to this tree. We keep only "maximal" words, i.e., words that are not prefixes of other 
words in the current tree. Clearly, the number of such "maximal" words will neither decrease nor exceed 
2*W+ 1 . Let A (see Fig. 4 from [6]) be the word from which the last branching from the word x occurs; after 
this, x continues without branching. To find x, it suffices to have the first digit k of x extending A, and the 
number i of maximal words at the moment when the tree being constructed branches at A (incrementing 
the number of maximal words to i). As i < 2^ q ' +1 , hence £{i) < £(q) + 1. Thus, 

KR{x) ^ I (Wq)ki\ x 2£(£(q)) + £(i) 4 2£(£(q)) + £(q) x 

- log 2 R(T X ) + 2 log 2 (- log 2 R(T X )). 

But, as proven earlier, 2 log 2 (- log 2 R(T X )) 4 2 log 2 [KR{x) + 2£(KR(x))} ^ 2 log 2 KR(x), so KR(x) 4 
— log 2 R(T X ) + 2 log 2 KR(x). The theorem is proved. 

Remark 12. It is worth mentioning that the usual measure-theoretic arguments assure that each measure P 
(not necessarily semi- computable) is almost fully concentrated on the set of all sequences lo for which 3c Vn 
P((w)n) > C -R((L)) n ). 

Similarly, for R-almost all sequences, the inverse inequality also holds. If P is absolutely continuous with 
respect to R, then the inverse inequality also holds for P-almost all sequences. This implies a statement 
similar to Theorem 11 for an arbitrary semi- computable measure P and prefixes of P-almost any sequence 
(of course, the constants may vary with sequences). 

As a corollary, we get the well-known de Leeuw-Moore-Shapiro-Shannon theorem about probabilistic 
machines: 

Corollary. A sequence is computable if and only if some semi-computable measure (hence, also the universal 
measure) is positive on it. 

Proof. By Theorem lll| the measure of all prefixes is larger than a positive number if and only if their 
"complexity of solution" K R is uniformly bounded. 



2.4 Probabilistic machines 

The above Shannon et al. result is sometimes interpreted as the impossibility for probabilistic machines 
to solve problems that are unsolvable deterministically. However, not all problems require constructing a 
specific unique object; some allow many solutions, being satisfied by producing any of thenr^l. This class 
clearly has problems that are unsolvable by deterministic machines but solvable if a machine can use a 
random number generator. An example of such problems is: to generate an uncomputable sequence. 

We say a problem of constructing a sequence with a property IT is solvable on a probabilistic machine if 
the universal measure R of the set of all such sequences is positive. The following theorem shows that such 
problems can indeed be solved on machines with an access to random number generators; moreover, they 
can be solved with an arbitrary given reliability, and quite efficiently, i.e., with small number of calls to the 
random number generator. We call functions f(n) and g{n) asymptotically equal, denoted f(n) ~ g(fi), if 
iog^/(ra) x log ^g(n) ' m section, inequalities =<! , !>= are understood in a similar "asymptotic" way. 

10 The corresponding concept of a mass problem was formulated in |31| . 
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Theorem 12. Let A C f2 tuif/i -R(-A) > 0. Then, for every e > 0, £/iere exists a fast-growing process F (i.e., 
one with £(F(x)) )p= £(x)) transforming L- distributed sequences into sequences in A with probability > 1 — e 

E 

Clearly, one cannot solve, e.g., the problem of obtaining a very complex sequence by using a process 
which grows faster, since the process cannot increase the complexity of words. If sequences from the desired 
set A have small complexity, then short programs can generate prefixes of these sequences. However, one 
can imagine that some A could make such programs so special that short random inputs cannot be used 
instead, forcing the slow growth of processes solving A. Interestingly, "fast" processes are also possible in all 
such cases. 

Theorem 13. Let g be a monotonic total recursive function. Then the problem of generating a sequence 
from a set A is solvable by a random process that grows as g(n') or faster (p! ~ n), if and only if there exists 
a se0 B C A such that R(B) > and all u) G B, have KR((u)) g r n \) n. 

Proof. The previous paragraph makes one direction obvious; we prove the other. Let BCA, R(B)>0 and 
K R((u>) fl ( n )) < n + clog n for all ueB, n. 

We first construct a semi-computable measure P with P(B)>0 and integer P((uj) g r n \)2 tn , for t n — 
n+ |~0(l) + (c+4) logn]. For that, we round down R(((j) g r n \) (cut proportionally to satisfy P(x0) +P(xl) < 
P(x)). Each rounding cuts the measure by < 2 _t ". But KR(x) < n + c logn for <7(n)-prefixes of ui£B, hence 
by Theorem [TT1 R(x) = 2~ n /0(n c+2 ). Thus, each rounding cuts 0(l)/n 2 fraction of their measure, leaving 
R{{u) g{n) )/0{1) for weB, and P{B) > 0. 

We then construct a process generating P as in the proof of Theorem 9, but select sets corresponding to 
pairs (x,t) (where t(x) = g(n)) to consist of intervals of length multiple of 2 _t ™. Clearly this process is the 
desired one. 

Let us describe another result about solvability of standard algorithmic problems on probabilistic ma- 
chines. The first interesting result of this type was proven by Janis Barzdin'. An infinite set of natural 
numbers is called immune if it does not contain any infinite recursively enumerable subset. 

Proposition. (Barzdin') There exists an immune set for which the problem of constructing a characteristic 
sequence of its infinite subset is solvable by a probabilistic machine. 

The class of all immune sets contains an interesting subclass of hyper-immune sets. For these sets, the 
following result holds. 

Theorem 14. For any hyper-immune set AI , the problem of constructing a characteristic sequence of its 
infinite subset cannot be solved by a probabilistic machine. 

Proof. Assume a machine can solve this problem with a positive probability. Then, by Theorem 12, 
a machine can solve it with probability p > 2/3. Construct a function f(i) computed by the following 
algorithm: run this machine on the tree of sequences until it generates, on measure > |, some sequences 
that have at least i ones; return the maximum of the positions of these ones. Clearly, this function dominates 
the direct enumeration of M. Q.E.D. 

Petri subsequently showed that if M is not fixed, the problem of generating the sequence characteristic 
for a hyper-immune set is solvable on probabilistic machines. However, the following holds: 

Theorem 15. Let us call a set strongly hyper-immune if its direct enumeration dominates, from some place 
on, each computable function. The problem of generating sequences characteristic for strongly hyper-immune 
sets is not solvable on probabilistic machines. 

The proof is similar to the one above. 

11 Note that first, the construction of this process is not always efficient and second, as shown by N. Petri, this process 
sometimes cannot be replaced by a table-based one (i.e., by a total recursive fast-growing process). 
12 This set B can always be selected to be closed. 
13 Also proven by V.N. Agafonov independently of the author. 
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3 Information Theory 



3.1 Definition and basic properties 

The complexity K(x) denotes, intuitively, the amount of information required for restoring a text x. The 
conditional complexity K(x\y) — the amount of information needed in addition to the information in text 
y, for restoring text x. The difference between these two can be naturally called the amount of information 
in y about x. 

Definition 13. (A.N. KOLMOGOROV j 

The amount of information in y about x is I(y:x)=K(x)-K(x\y). 

Remark 13. (a) I(x : y) > 0; (b) I(x : x) x K(x). 

Proof, (a) Let F 2 (p, x) — i*o (p). Then, if Fq(p ) — y and K (y) — l(po), from F 2 (p , x) — y, we conclude: 
K(y\x)4K F2 (y\x) = K(y). 

(b) Let F 2 (p,x) = x. Then F 2 (A, x) — x, and consequently K(x\x) =4 K F i(x\x) = £(A) = 0. It remains 
to observe that I(x : x) — K(x) — K(x\x). 

3.2 Commutativity of information 

Shannon's classical definition of the amount of mutual information in two random variables is commutative, 
that is, J(£ : 77) = J(rj : £). For Kolmogorov's concept of the amount of information in one text about another, 
such a precise equality, in general, will not hold. 

Example. Clearly, some words x of each length have K(x\£(x)) > £(x)—l. 

By Theorem 4(b), there exist arbitrarily large values of I with K(l) > £(l) — 1. For so chosen pairs x, l=£(x), 

I(x : I) = K(l) - K(l\x) >p 1(1), 

1(1 : x) = K(x) - K(x\l) 4 1 -1 = 0. 

Thus, in some cases, I(x : y) and I(y : x) differ by order of the logarithm of the complexities of x, y. 
But A.N. KOLMOGOROV and L. Levin showed independently in 1967 that this is the largest possible 
order of magnitude for this difference. So, disregarding the smaller order quantities, I(x, y) is commutative. 
Specifically, A.N. KOLMOGOROV and L. Levin proved the following: 

Theorem 16. 

(a) \I(x:y)-I(y:x)\ < 12£(K(xy)); 

(b) \I(x : y) - [K(x) + K(y) - K(xy)}\ < 12£(K(xy)). 
Proof, (a) We prove the inequality in one direction: 

I(x : y) ^ I(y ■ x) - l2£(K(xy)). (3.1) 

The other follows by swapping x and y. 

We construct two auxiliary functions. Let the partial recursive function F 4 (n, b, c, a;) enumerate without 
repetitions the words y such that K (y) < b, K(x\y) < c. The existence of such a function follows from [SJ 
Theorem 0.4] (taking into account the remark). Let j (an unconfutable function of x, b, c) be the number of 
such words y. Function F 4 halts for all n < j and only for them. Hence the predicate 11(6, c, d, x), asserting 
that j defined above is > 2 d , is equivalent to halting of F 4 (2 d , b, c, x) and so is partial recursive. Similarly, 
there exists a function G 5 (m,b,c,d) enumerating without repetitions all words x with 11(6, c, d, x). Denote 
by i (an uncomputable function of 6, c, d) the number of such x. Obviously G 5 (m, 6, c, d) halts for all m < i 
and only for them. 

14 With more careful estimates, the bound can be tightened. For instance, 12 can be replaced by 5+e. It is not known whether 
it can be brought down to 1. 
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We now start the proof. Let a = K(x), b = K(y). c = K(x\y). Then 

I(y : x) = a — c. 

With j,d=£(j) and i so defined, clearly i ■ 2 d does not exceed the number of pairs (x',y') with K(y') < b. 
K{x'\y') < c. That number is < 2 b+c + 2 , so 

£(i) + £(j) 4b + c. (3.2) 

Since F 4 (n, b, c, x) returns y for some n < j, 

K(y\x) 4 e(bcn) 4 2£(b) + 21(c) + £(j). (3.3) 
Furthermore, since G 5 (m, b, c, d) returns x for d = £(j) and some m < i, 

a = K(x) 4 £(bcdm) 4 21(b) + 2£(c) + 2£{d) + £{i). (3.4) 

Inequalities ([3~2 f - (pT4"f and the £(K(xy)) bound on each of £(b),£(c),£(d) imply K(y\x) 4 b + c- a + 
12£(K(xy)). Claim (a) follows. 

(b) Clearly, K(xxy) 4 K(xy). So, by claim (a), \I(xy : x) — I(x : xy)\ 4 12£(K(xy)), that is, \K(xy) — 
K(xy\x) - K(x) + K(x\xy)\ 4 l2£(K(xy)), or 

\[K(xy) - K[x) - K(y)} + K(y) - K(xy\x) + K{x\xy)\ 4 \2£{K{xy)). 

Now claim (b) follows by noting that K(x\xy) x 0, K(xy\x) x K(y\x). 

3.3 Entropy of arbitrary dynamic systems (stationary stochastic processes) and 
algorithmic amount of information 

A.N.KOLMOGOROV showed that for processes of independent trials the algorithmic amount of information 
is asymptotically equal to the classical (probabilistic) one (see [51 Theorem 5.3]). In view of Theorem [TCTb). 
this follows from the connection between algorithmic complexity and probabilistic entropy. 

J.T.Schwartz posed the question of whether a similar fact holds for an arbitrary ergodic stationary 
process (that is, a process for which entropy is defined). We give a positive answer to this question in the 
following theorem. 

Theorem 17. Let i — 1,2, ... , be an arbitrary ergodic stationary stochastic process with values £j G O, 
let P be the measure on its trajectories u £ tt^ that defines this process, and let H be its entropy. Denote 
by Q. l n (uj) the word (£,i) n (£,2)n ■ • • Then for P-almost all u 

hm hm — = H. 

n— ^00 i— >oo i 

Clearly, the ergodicity requirement here is not essential. For non-ergodic processes, instead of their 
average entropy H, one would take "entropy at a point": a function measurable with respect to the o~- 
algebra of invariant sets, averaging on any such set to its average entropy on the set. This easily follows 
"decomposition" of arbitrary stationary stochastic processes into ergodic ones. 

Returning to the ergodic case, it suffices to prove the theorem for processes with discrete values (£) n - 
The general case will follow by taking the limit on n. 

Consider the set of 2™-ary sequences u - trajectories of our stochastic process. Defined on this set is a 
transformation T shifting the time by 1 and a T-invariant ergodic measure describing the process. Within k 
time steps, 2 n ' k different sequences X\ of length k can appear. Clearly, for every e, a k exists such that 

- P(X?)log 2 P(X?)<k-(H + e). 

i<2 nk 

aince T k , as well as T, preserves the measure P . it follows from the Central Ergodic Theorem (C.E.T.) that 
for P-almost every sequence there exists, for every I, a limit of the frequency of the values of m for which 
the sequence T mk+l (w) begins with X k . 
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Take any such u) and denote these limits for it by P^. From C.E.T. for T and the ergodicity of T, it 
follows that almost always ^2 l<k ~~k~ = -^i-^-i)- Hence we have here k probability distributions on the finite 
set X\ and their average with an entropy < k(H + e). 

By convexity of entropy, at least one of the summand distributions has entropy < K{H + e). So for 
our u, an I exists such that the entropy of the frequencies Pij with which number m satisfies the condition 
"T mk+l (to) begins with A 4 fc " is < K (H + e). By a theorem of Kolmogorov (see [SI Theorem 5.1]), it follows 
that the "unit complexity" of almost all u is < H, which gives a "half" of our theorem. 

To prove the unit complexity to be > H , we use some results from Section 2. Consider the collection Xj* 
of values of some realization lo of the process over the first k time steps and compare four quantities: the 
entropy H\ the logarithm of the probability of that collection divided by k, that is. log P ^ Xi - ; the logarithm of 
its a priori probability (see the definition of R) divided by k also, that is, log R ^ Xi ; and the unit complexity 

— \ ■ Their limits as k — > oo are equal. For the first two quantities, this follows from the Shannon- 
McMillan-Breiman theorem; for the last two, from Theorem 11 of this dissertation; and for the two in the 
middle, from the last remark of Section 2.3. The theorem is proved. 
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